Seongdeok Oh – assignment-3

EPPS 6302 Methods of Data Collection and Production

Assignment 4: Webscraping

To complete Assignment 4, start by using rvest_wiki01.R to scrape foreign reserve data from Wikipedia and modify the code as needed to capture additional tables. Then, search for government documents on govinfo.gov and use govtdata01.R to download ten documents.
In the report, describe any challenges encountered in the scraping process, such as variations in table structures on Wikipedia, anti-scraping measures on government websites, or issues with dynamic content loading. Evaluate the usability of the scraped data, noting any limitations like incomplete or inconsistent data.
For improvement, I think that we need to consider using proxy rotation to bypass anti-bot mechanisms, enhancing data-cleaning techniques to ensure consistency, and leveraging robust scraping tools to handle complex or changing website structures better.

## Workshop: Scraping webpages with R rvest package
# Prerequisites: Chrome browser, Selector Gadget

# install.packages("tidyverse")
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#install.packages("rvest")
library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

url <- 'https://en.wikipedia.org/wiki/List_of_countries_by_foreign-exchange_reserves'
#Reading the HTML code from the Wiki website
wikiforreserve <- read_html(url)
class(wikiforreserve)

[1] "xml_document" "xml_node"

## Get the XPath data using Inspect element feature in Safari, Chrome or Firefox
## At Inspect tab, look for <table class=....> tag. Leave the table close
## Right click the table and Copy --> XPath, paste at html_nodes(xpath =)

foreignreserve <- wikiforreserve %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div[1]/table[1]') %>%
  html_table()
class(foreignreserve) # Why the first column is not scrapped?

[1] "list"

fores = foreignreserve[[1]][,c(1, 2,3,4,5,6,7,8) ] # [[ ]] returns a single element directly, without retaining the list structure.


# 
names(fores) <- c("Country", "Forexreswithgold", "Date1", "Change1","Forexreswithoutgold", "Date2","Change2", "Sources")
colnames(fores)

[1] "Country"             "Forexreswithgold"    "Date1"              
[4] "Change1"             "Forexreswithoutgold" "Date2"              
[7] "Change2"             "Sources"

head(fores$Country, n=10)

 [1] "China"        "Japan"        "Switzerland"  "India"        "Russia"      
 [6] "Taiwan"       "Saudi Arabia" "Hong Kong"    "South Korea"  "Singapore"

# Sources column useful?

## Clean up variables
## What type is Date?

# Convert Date1 variable
fores$Date1 = as.Date(fores$Date1, format = "%d %b %Y")
class(fores$Date1)

[1] "Date"

write.csv(fores, "fores.csv", row.names = FALSE) # use fwrite?

# New Table

foreignreserve1 <- wikiforreserve %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div[1]/table[2]') %>%
  html_table()
class(foreignreserve1) # Why the first column is not scrapped?

[1] "list"

fores1 = foreignreserve1[[1]][,c("USD", "EUR","JPY","GBP","CAD","RMB","AUD","CHF") ] # [[ ]] returns a single element directly, without retaining the list structure.


# 
names(fores1) <- c("USD", "EUR","JPY","GBP","CAD","RMB","AUD","CHF")
colnames(fores1)

[1] "USD" "EUR" "JPY" "GBP" "CAD" "RMB" "AUD" "CHF"

write.csv(fores1, "fores1.csv", row.names = FALSE) # use fwrite?